Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot language grounding. In particular, we utilize CLIP to tackle the novel problem of zero-shot VLN using natural language referring expressions that describe target objects, in contrast to past work that used simple language templates describing object classes. We examine CLIP's capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes. Our results on the coarse-grained instruction following task of REVERIE demonstrate the navigational capability of CLIP, surpassing the supervised baseline in terms of both success rate (SR) and success weighted by path length (SPL). More importantly, we quantitatively show that our CLIP-based zero-shot approach generalizes better to show consistent performance across environments when compared to SOTA, fully supervised learning approaches when evaluated via Relative Change in Success (RCS).
translated by 谷歌翻译
物理重新安排的物体是体现剂的重要功能。视觉室的重新安排评估了代理在房间中重新安排对象的能力,仅基于视觉输入而获得所需的目标。我们为此问题提出了一种简单而有效的方法:(1)搜索并映射需要重新排列哪些对象,以及(2)重新排列每个对象,直到任务完成为止。我们的方法包括一个现成的语义分割模型,基于体素的语义图和语义搜索策略,以有效地找到需要重新排列的对象。在AI2 - 重新排列的挑战中,我们的方法改进了当前最新的端到端增强学习方法,这些方法从0.53%的正确重排达到16.56%,学习视觉重排政策,仅使用2.7%,仅使用2.7%来自环境的样本。
translated by 谷歌翻译
Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill.To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents, and over 15% of the videos have more than one person. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.
translated by 谷歌翻译
Understanding deep learning model behavior is critical to accepting machine learning-based decision support systems in the medical community. Previous research has shown that jointly using clinical notes with electronic health record (EHR) data improved predictive performance for patient monitoring in the intensive care unit (ICU). In this work, we explore the underlying reasons for these improvements. While relying on a basic attention-based model to allow for interpretability, we first confirm that performance significantly improves over state-of-the-art EHR data models when combining EHR data and clinical notes. We then provide an analysis showing improvements arise almost exclusively from a subset of notes containing broader context on patient state rather than clinician notes. We believe such findings highlight deep learning models for EHR data to be more limited by partially-descriptive data than by modeling choice, motivating a more data-centric approach in the field.
translated by 谷歌翻译
可以提前以低虚假警报率预测不良事件的模型对于接受医学界的决策支持系统至关重要。这项具有挑战性的机器学习任务通常仍被视为简单的二进制分类,并提出了一些定制方法来利用样本之间的时间依赖性。我们提出了时间标签平滑(TLS),这是一种新颖的学习策略,可调节平滑强度,这是与感兴趣的事件接近的函数。这种正则化技术降低了在类边界上的模型置信度,在该阶级边界中,信号通常是嘈杂或不信息的,因此训练可以集中在远离该边界区域的临床信息丰富的数据点上。从理论的角度来看,我们还表明,我们的方法可以作为多屈曲预测的扩展,这是在其他早期预测工作中提出的学习启发式词。 TLS从经验上匹配或跑赢大盘,考虑了各种早期预测基准任务的竞争方法。特别是,我们的方法可显着提高与临床相关的指标的性能,例如以低弹药率以较低的事件召回。
translated by 谷歌翻译
预测学生的学习成绩是教育数据挖掘(EDM)的关键任务之一。传统上,这种模型的高预测质量被认为至关重要。最近,公平和歧视W.R.T.受保护的属性(例如性别或种族)引起了人们的关注。尽管EDM中有几种公平感知的学习方法,但对这些措施的比较评估仍然缺失。在本文中,我们评估了各种教育数据集和公平感知学习模型上学生绩效预测问题的不同群体公平措施。我们的研究表明,公平度量的选择很重要,对于选择等级阈值的选择同样。
translated by 谷歌翻译
小组工作是在教育环境中的一项普遍活动,在该活动中,学生通常会根据他们的偏好将学生分为特定于主题的小组。小组应尽可能地反映学生的愿望。通常,由于研究表明学生在多样化的群体中的学习可能会更好,因此最终的群体也应根据性别或种族等受保护的属性进行平衡。此外,平衡小组红衣主义也是整个小组公平工作负载分配的重要要求。在本文中,我们介绍了多面能力(MFC)分组问题,该问题将学生公平地分配给非重叠的小组,同时确保平衡的组红衣(具有下限和上限),并最大程度地利用成员的多样性。受保护的属性。我们提出了两种方法:一种启发式方法和一种基于背包的方法来获得MFC分组。真实数据集和半合成数据集的实验表明,我们提出的方法可以很好地满足学生的偏好,并分别提供有关基数和受保护属性的平衡和多样化的群体。
translated by 谷歌翻译
可解释的机器学习(IML)与机器学习模型的行为和属性有关。但是,科学家只对模型感兴趣,作为理解建模现象的门户。我们展示了如何开发IML方法,使它们可以深入了解相关现象的特性。我们认为,当前的IML研究将模型分析的两个目标融合在一起 - 模型审核和科学推断。因此,尚不清楚模型解释是否具有相应的现象解释。在统计决策理论的基础上,我们表明ML模型分析允许描述联合数据概率分布的相关方面。我们为构建IML描述符提供了一个五步框架,可以帮助解决科学问题,包括一种自然的方法来量化认知不确定性。我们以科学为中心的现象以IML为中心的方法阐明了:IML推断的机会和局限性;需要条件不是边缘抽样;而且,我们可以信任IML方法的条件。
translated by 谷歌翻译
临床文本注释(CTN)包含医生的推理过程,以非结构化的自由文本格式编写,他们检查和采访患者。近年来,已经发表了几项研究,这些研究为机器学习的实用性提供了证据,以预测CTN的医生诊断,这是一项称为ICD编码的任务。数据注释很耗时,尤其是在需要一定程度的专业化时,就像医疗数据一样。本文提出了一种以半自我监督的方式增强冰岛CTN的稀疏注释数据集的方法。我们在一小部分带注释的CTN上训练神经网络,并使用它从一组未通畅的CTN中提取临床特征。这些临床特征包括对医生可能会在患者咨询期间找到答案的大约一千个潜在问题的答案。然后,这些功能用于训练分类器以诊断某些类型的疾病。我们报告了对医生的三个数据可用性评估该数据增强方法的评估结果。我们的数据增强方法显示出显着的积极作用,当检查患者和诊断的临床特征时,这会减少。我们建议使用基于不包括考试或测试的临床特征做出决策的系统增强稀缺数据集的方法。
translated by 谷歌翻译
通过强制了解输入中某些转换保留输出的知识,通常应用数据增强来提高深度学习的性能。当前,使用的数据扩大是通过人类的努力和昂贵的交叉验证来选择的,这使得应用于新数据集很麻烦。我们开发了一种基于梯度的方便方法,用于在没有验证数据的情况下和在深度神经网络的培训期间选择数据增强。我们的方法依赖于措辞增强作为先前分布的不变性,并使用贝叶斯模型选择学习,该模型已被证明在高斯过程中起作用,但尚未用于深神经网络。我们提出了一个可区分的Kronecker因拉普拉斯(Laplace)近似与边际可能性的近似,作为我们的目标,可以在没有人类监督或验证数据的情况下优化。我们表明,我们的方法可以成功地恢复数据中存在的不断增长,这提高了图像数据集的概括和数据效率。
translated by 谷歌翻译